Learning Scikit-learn: Machine Learning in Python

IPython Notebook for Chapter 4: Advanced Features - Feature Engineering and Selection

The usual scenario for learning tasks such as those presented in this book include a list of instances (represented as feature/value pairs) and a special feature (the target class) that we want to predict for future instances based on the values of the remaining features. However, the source data does not usually come in this format. We have to extract what we think are potentially useful features and convert them to our learning format. This process is called feature extraction or feature engineering, and it is an often underestimated but very important and time-consuming phase in most real- world machine learning tasks.

Start by importing numpy, scikit-learn, pandas, and pyplot, the Python libraries we will be using in this chapter. Show the versions we will be using (in case you have problems running the notebooks).


In [1]:
%pylab inline
import IPython
import sklearn as sk
import numpy as np
import matplotlib
import pandas as pd
import matplotlib.pyplot as plt

print 'IPython version:', IPython.__version__
print 'numpy version:', np.__version__
print 'scikit-learn version:', sk.__version__
print 'matplotlib version:', matplotlib.__version__
print 'pandas version:', pd.__version__


Populating the interactive namespace from numpy and matplotlib
IPython version: 2.1.0
numpy version: 1.8.2
scikit-learn version: 0.15.1
matplotlib version: 1.3.1
pandas version: 0.14.1

Import titanic data using pandas

The Python package pandas (http://pandas.pydata.org/), for example, provides data structures and tools for data analysis. It aims to provide similar features to those of R, the popular language and environment for statistical computing. We will use pandas to import the Titanic data we presented in Chapter 2, Supervised Learning, and convert them to the scikit-learn format.


In [2]:
titanic = pd.read_csv('data/titanic.csv')
print titanic


      row.names pclass  survived  \
0             1    1st         1   
1             2    1st         0   
2             3    1st         0   
3             4    1st         0   
4             5    1st         1   
5             6    1st         1   
6             7    1st         1   
7             8    1st         0   
8             9    1st         1   
9            10    1st         0   
10           11    1st         0   
11           12    1st         1   
12           13    1st         1   
13           14    1st         1   
14           15    1st         0   
15           16    1st         1   
16           17    1st         0   
17           18    1st         0   
18           19    1st         1   
19           20    1st         1   
20           21    1st         1   
21           22    1st         0   
22           23    1st         1   
23           24    1st         1   
24           25    1st         1   
25           26    1st         0   
26           27    1st         1   
27           28    1st         1   
28           29    1st         1   
29           30    1st         0   
...         ...    ...       ...   
1283       1284    3rd         0   
1284       1285    3rd         0   
1285       1286    3rd         0   
1286       1287    3rd         0   
1287       1288    3rd         0   
1288       1289    3rd         0   
1289       1290    3rd         1   
1290       1291    3rd         0   
1291       1292    3rd         0   
1292       1293    3rd         0   
1293       1294    3rd         1   
1294       1295    3rd         0   
1295       1296    3rd         0   
1296       1297    3rd         0   
1297       1298    3rd         0   
1298       1299    3rd         0   
1299       1300    3rd         0   
1300       1301    3rd         0   
1301       1302    3rd         0   
1302       1303    3rd         1   
1303       1304    3rd         0   
1304       1305    3rd         1   
1305       1306    3rd         0   
1306       1307    3rd         0   
1307       1308    3rd         0   
1308       1309    3rd         0   
1309       1310    3rd         0   
1310       1311    3rd         0   
1311       1312    3rd         0   
1312       1313    3rd         0   

                                                  name      age     embarked  \
0                         Allen, Miss Elisabeth Walton  29.0000  Southampton   
1                          Allison, Miss Helen Loraine   2.0000  Southampton   
2                  Allison, Mr Hudson Joshua Creighton  30.0000  Southampton   
3      Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)  25.0000  Southampton   
4                        Allison, Master Hudson Trevor   0.9167  Southampton   
5                                   Anderson, Mr Harry  47.0000  Southampton   
6                     Andrews, Miss Kornelia Theodosia  63.0000  Southampton   
7                               Andrews, Mr Thomas, jr  39.0000  Southampton   
8         Appleton, Mrs Edward Dale (Charlotte Lamson)  58.0000  Southampton   
9                               Artagaveytia, Mr Ramon  71.0000    Cherbourg   
10                           Astor, Colonel John Jacob  47.0000    Cherbourg   
11    Astor, Mrs John Jacob (Madeleine Talmadge Force)  19.0000    Cherbourg   
12                        Aubert, Mrs Leontine Pauline      NaN    Cherbourg   
13                           Barkworth, Mr Algernon H.      NaN  Southampton   
14                                 Baumann, Mr John D.      NaN  Southampton   
15      Baxter, Mrs James (Helene DeLaudeniere Chaput)  50.0000    Cherbourg   
16                             Baxter, Mr Quigg Edmond  24.0000    Cherbourg   
17                                 Beattie, Mr Thomson  36.0000    Cherbourg   
18                        Beckwith, Mr Richard Leonard  37.0000  Southampton   
19     Beckwith, Mrs Richard Leonard (Sallie Monypeny)  47.0000  Southampton   
20                                Behr, Mr Karl Howell  26.0000    Cherbourg   
21                                  Birnbaum, Mr Jakob  25.0000    Cherbourg   
22                             Bishop, Mr Dickinson H.  25.0000    Cherbourg   
23             Bishop, Mrs Dickinson H. (Helen Walton)  19.0000    Cherbourg   
24             Bjornstrm-Steffansson, Mr Mauritz Hakan  28.0000  Southampton   
25                         Blackwell, Mr Stephen Weart  45.0000  Southampton   
26                                     Blank, Mr Henry  39.0000    Cherbourg   
27                              Bonnell, Miss Caroline  30.0000  Southampton   
28                             Bonnell, Miss Elizabeth  58.0000  Southampton   
29                             Borebank, Mr John James      NaN  Southampton   
...                                                ...      ...          ...   
1283               Vestrom, Miss Hulda Amanda Adolfina      NaN          NaN   
1284                                    Vonk, Mr Jenko      NaN          NaN   
1285                                Ware, Mr Frederick      NaN          NaN   
1286                        Warren, Mr Charles William      NaN          NaN   
1287                                  Wazli, Mr Yousif      NaN          NaN   
1288                                  Webber, Mr James      NaN          NaN   
1289                     Wennerstrom, Mr August Edvard      NaN          NaN   
1290                                Wenzel, Mr Linhart      NaN          NaN   
1291                        Widegren, Mr Charles Peter      NaN          NaN   
1292                          Wiklund, Mr Jacob Alfred      NaN          NaN   
1293                                 Wilkes, Mrs Ellen      NaN          NaN   
1294                                  Willer, Mr Aaron      NaN          NaN   
1295                                 Willey, Mr Edward      NaN          NaN   
1296                          Williams, Mr Howard Hugh      NaN          NaN   
1297                               Williams, Mr Leslie      NaN          NaN   
1298                                Windelov, Mr Einar      NaN          NaN   
1299                                   Wirz, Mr Albert      NaN          NaN   
1300                             Wiseman, Mr Phillippe      NaN          NaN   
1301                           Wittevrongel, Mr Camiel      NaN          NaN   
1302                                 Yalsevac, Mr Ivan      NaN          NaN   
1303                                Yasbeck, Mr Antoni      NaN          NaN   
1304                               Yasbeck, Mrs Antoni      NaN          NaN   
1305                                Youssef, Mr Gerios      NaN          NaN   
1306                               Zabour, Miss Hileni      NaN          NaN   
1307                               Zabour, Miss Tamini      NaN          NaN   
1308                                Zakarian, Mr Artun      NaN          NaN   
1309                            Zakarian, Mr Maprieder      NaN          NaN   
1310                                   Zenn, Mr Philip      NaN          NaN   
1311                                     Zievens, Rene      NaN          NaN   
1312                                    Zimmerman, Leo      NaN          NaN   

                               home.dest     room             ticket   boat  \
0                           St Louis, MO      B-5         24160 L221      2   
1        Montreal, PQ / Chesterville, ON      C26                NaN    NaN   
2        Montreal, PQ / Chesterville, ON      C26                NaN  (135)   
3        Montreal, PQ / Chesterville, ON      C26                NaN    NaN   
4        Montreal, PQ / Chesterville, ON      C22                NaN     11   
5                           New York, NY     E-12                NaN      3   
6                             Hudson, NY      D-7          13502 L77     10   
7                            Belfast, NI     A-36                NaN    NaN   
8                    Bayside, Queens, NY    C-101                NaN      2   
9                    Montevideo, Uruguay      NaN                NaN   (22)   
10                          New York, NY      NaN  17754 L224 10s 6d  (124)   
11                          New York, NY      NaN  17754 L224 10s 6d      4   
12                         Paris, France     B-35       17477 L69 6s      9   
13                         Hessle, Yorks     A-23                NaN      B   
14                          New York, NY      NaN                NaN    NaN   
15                          Montreal, PQ  B-58/60                NaN      6   
16                          Montreal, PQ  B-58/60                NaN    NaN   
17                          Winnipeg, MN      C-6                NaN    NaN   
18                          New York, NY     D-35                NaN      5   
19                          New York, NY     D-35                NaN      5   
20                          New York, NY    C-148                NaN      5   
21                     San Francisco, CA      NaN                NaN  (148)   
22                          Dowagiac, MI     B-49                NaN      7   
23                          Dowagiac, MI     B-49                NaN      7   
24    Stockholm, Sweden / Washington, DC      NaN                         D   
25                           Trenton, NJ      NaN                NaN  (241)   
26                        Glen Ridge, NJ     A-31                NaN      7   
27                        Youngstown, OH      C-7                NaN      8   
28     Birkdale, England Cleveland, Ohio    C-103                NaN      8   
29                 London / Winnipeg, MB   D-21/2                NaN    NaN   
...                                  ...      ...                ...    ...   
1283                                 NaN      NaN                NaN    NaN   
1284                                 NaN      NaN                NaN    NaN   
1285                                 NaN      NaN                NaN    NaN   
1286                                 NaN      NaN                NaN    NaN   
1287                                 NaN      NaN                NaN    NaN   
1288                                 NaN      NaN                NaN    NaN   
1289                                 NaN      NaN                NaN    NaN   
1290                                 NaN      NaN                NaN    NaN   
1291                                 NaN      NaN                NaN    NaN   
1292                                 NaN      NaN                NaN    NaN   
1293                                 NaN      NaN                NaN    NaN   
1294                                 NaN      NaN                NaN    NaN   
1295                                 NaN      NaN                NaN    NaN   
1296                                 NaN      NaN                NaN    NaN   
1297                                 NaN      NaN                NaN    NaN   
1298                                 NaN      NaN                NaN    NaN   
1299                                 NaN      NaN                NaN    NaN   
1300                                 NaN      NaN                NaN    NaN   
1301                                 NaN      NaN                NaN    NaN   
1302                                 NaN      NaN                NaN    NaN   
1303                                 NaN      NaN                NaN    NaN   
1304                                 NaN      NaN                NaN    NaN   
1305                                 NaN      NaN                NaN    NaN   
1306                                 NaN      NaN                NaN    NaN   
1307                                 NaN      NaN                NaN    NaN   
1308                                 NaN      NaN                NaN    NaN   
1309                                 NaN      NaN                NaN    NaN   
1310                                 NaN      NaN                NaN    NaN   
1311                                 NaN      NaN                NaN    NaN   
1312                                 NaN      NaN                NaN    NaN   

         sex  
0     female  
1     female  
2       male  
3     female  
4       male  
5       male  
6     female  
7       male  
8     female  
9       male  
10      male  
11    female  
12    female  
13      male  
14      male  
15    female  
16      male  
17      male  
18      male  
19    female  
20      male  
21      male  
22      male  
23    female  
24      male  
25      male  
26      male  
27    female  
28    female  
29      male  
...      ...  
1283  female  
1284    male  
1285    male  
1286    male  
1287    male  
1288    male  
1289    male  
1290    male  
1291    male  
1292    male  
1293  female  
1294    male  
1295    male  
1296    male  
1297    male  
1298    male  
1299    male  
1300    male  
1301    male  
1302    male  
1303    male  
1304  female  
1305    male  
1306  female  
1307  female  
1308    male  
1309    male  
1310    male  
1311  female  
1312    male  

[1313 rows x 11 columns]

You can see that each csv column has a corresponding feature into the DataFrame, and that the feature type is induced from the available data. We can inspect some features to see what they look like.


In [3]:
print titanic.head()[['pclass', 'survived', 'age', 'embarked', 'boat', 'sex']]


  pclass  survived      age     embarked   boat     sex
0    1st         1  29.0000  Southampton      2  female
1    1st         0   2.0000  Southampton    NaN  female
2    1st         0  30.0000  Southampton  (135)    male
3    1st         0  25.0000  Southampton    NaN  female
4    1st         1   0.9167  Southampton     11    male

In [4]:
titanic.describe()


Out[4]:
row.names survived age
count 1313.000000 1313.000000 633.000000
mean 657.000000 0.341965 31.194181
std 379.174762 0.474549 14.747525
min 1.000000 0.000000 0.166700
25% 329.000000 0.000000 21.000000
50% 657.000000 0.000000 30.000000
75% 985.000000 1.000000 41.000000
max 1313.000000 1.000000 71.000000

Feature extraction

he main difficulty we have now is that scikit-learn methods expect real numbers as feature values. In Chapter 2, Supervised Learning, we used the LabelEncoder and OneHotEncoder preprocessing methods to manually convert certain categorical features into 1-of-K values (generating a new feature for each possible value; valued 1 if the original feature had the corresponding value and 0 otherwise). This time, we will use a similar scikit-learn method, DictVectorizer, which automatically builds these features from the different original feature values. Moreover, we will program a method to encode a set of columns in a unique step.


In [5]:
from sklearn import feature_extraction

def one_hot_dataframe(data, cols, replace=False):
    """ Takes a dataframe and a list of columns that need to be encoded.
    Returns a 3-tuple comprising the data, the vectorized data,
    and the fitted vectorizor.
    Modified from https://gist.github.com/kljensen/5452382
    """
    vec = feature_extraction.DictVectorizer()
    mkdict = lambda row: dict((col, row[col]) for col in cols)
    
    #print 'Construyo vecData...'
    #print data[cols]
    #print cols

    # Create a dictionary for each row
    
    #print data[cols].apply(mkdict, axis=1).data
    #[0]['pclass']

    #vecData = pd.DataFrame(vec.fit_transform(data[cols].apply(mkdict, axis=1)).toarray())
    vecData = pd.DataFrame(vec.fit_transform(data[cols].to_dict(outtype='records')).toarray())
    vecData.columns = vec.get_feature_names()
    vecData.index = data.index
    if replace is True:
        data = data.drop(cols, axis=1)
        data = data.join(vecData)
    return (data, vecData)

titanic, titanic_n= one_hot_dataframe(titanic, ['pclass', 'embarked', 'sex'], replace=True)

In [6]:
titanic.describe()


Out[6]:
row.names survived age embarked embarked=Cherbourg embarked=Queenstown embarked=Southampton pclass=1st pclass=2nd pclass=3rd sex=female sex=male
count 1313.000000 1313.000000 633.000000 821 1313.000000 1313.000000 1313.000000 1313.000000 1313.000000 1313.000000 1313.000000 1313.000000
mean 657.000000 0.341965 31.194181 0 0.154608 0.034273 0.436405 0.245240 0.213252 0.541508 0.352628 0.647372
std 379.174762 0.474549 14.747525 0 0.361668 0.181998 0.496128 0.430393 0.409760 0.498464 0.477970 0.477970
min 1.000000 0.000000 0.166700 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 329.000000 0.000000 21.000000 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 657.000000 0.000000 30.000000 0 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000
75% 985.000000 1.000000 41.000000 0 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 1.000000 1.000000
max 1313.000000 1.000000 71.000000 0 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

What does the 'embarked' feature has?


In [7]:
print titanic_n.head(5)
print titanic_n[titanic_n['embarked'] != 0].head()


   embarked  embarked=Cherbourg  embarked=Queenstown  embarked=Southampton  \
0         0                   0                    0                     1   
1         0                   0                    0                     1   
2         0                   0                    0                     1   
3         0                   0                    0                     1   
4         0                   0                    0                     1   

   pclass=1st  pclass=2nd  pclass=3rd  sex=female  sex=male  
0           1           0           0           1         0  
1           1           0           0           1         0  
2           1           0           0           0         1  
3           1           0           0           1         0  
4           1           0           0           0         1  
     embarked  embarked=Cherbourg  embarked=Queenstown  embarked=Southampton  \
62        NaN                   0                    0                     0   
165       NaN                   0                    0                     0   
195       NaN                   0                    0                     0   
196       NaN                   0                    0                     0   
229       NaN                   0                    0                     0   

     pclass=1st  pclass=2nd  pclass=3rd  sex=female  sex=male  
62            1           0           0           0         1  
165           1           0           0           0         1  
195           1           0           0           0         1  
196           1           0           0           0         1  
229           1           0           0           0         1  

Convert the remaining categorical features...


In [8]:
print titanic.head()
titanic, titanic_n = one_hot_dataframe(titanic, ['home.dest', 'room', 'ticket', 'boat'], replace=True)


   row.names  survived                                             name  \
0          1         1                     Allen, Miss Elisabeth Walton   
1          2         0                      Allison, Miss Helen Loraine   
2          3         0              Allison, Mr Hudson Joshua Creighton   
3          4         0  Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)   
4          5         1                    Allison, Master Hudson Trevor   

       age                        home.dest room      ticket   boat  embarked  \
0  29.0000                     St Louis, MO  B-5  24160 L221      2         0   
1   2.0000  Montreal, PQ / Chesterville, ON  C26         NaN    NaN         0   
2  30.0000  Montreal, PQ / Chesterville, ON  C26         NaN  (135)         0   
3  25.0000  Montreal, PQ / Chesterville, ON  C26         NaN    NaN         0   
4   0.9167  Montreal, PQ / Chesterville, ON  C22         NaN     11         0   

   embarked=Cherbourg  embarked=Queenstown  embarked=Southampton  pclass=1st  \
0                   0                    0                     1           1   
1                   0                    0                     1           1   
2                   0                    0                     1           1   
3                   0                    0                     1           1   
4                   0                    0                     1           1   

   pclass=2nd  pclass=3rd  sex=female  sex=male  
0           0           0           1         0  
1           0           0           1         0  
2           0           0           0         1  
3           0           0           1         0  
4           0           0           0         1  

We also have to deal with missing values, since DecisionTreeClassifier we plan to use does not admit them on input. Pandas allow us to replace them with a fixed value using the fillna method. We will use the mean age for the age feature, and 0 for the remaining missing attributes. Adjust N/A ages with the mean age


In [9]:
print titanic['age'].describe()
mean = titanic['age'].mean()
titanic['age'].fillna(mean, inplace=True)
print titanic['age'].describe()


count    633.000000
mean      31.194181
std       14.747525
min        0.166700
25%       21.000000
50%       30.000000
75%       41.000000
max       71.000000
dtype: float64
count    1313.000000
mean       31.194181
std        10.235540
min         0.166700
25%        30.000000
50%        31.194181
75%        31.194181
max        71.000000
dtype: float64

Complete n/a with zeros


In [10]:
titanic.fillna(0, inplace=True)

In [11]:
print titanic


      row.names  survived                                              name  \
0             1         1                      Allen, Miss Elisabeth Walton   
1             2         0                       Allison, Miss Helen Loraine   
2             3         0               Allison, Mr Hudson Joshua Creighton   
3             4         0   Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)   
4             5         1                     Allison, Master Hudson Trevor   
5             6         1                                Anderson, Mr Harry   
6             7         1                  Andrews, Miss Kornelia Theodosia   
7             8         0                            Andrews, Mr Thomas, jr   
8             9         1      Appleton, Mrs Edward Dale (Charlotte Lamson)   
9            10         0                            Artagaveytia, Mr Ramon   
10           11         0                         Astor, Colonel John Jacob   
11           12         1  Astor, Mrs John Jacob (Madeleine Talmadge Force)   
12           13         1                      Aubert, Mrs Leontine Pauline   
13           14         1                         Barkworth, Mr Algernon H.   
14           15         0                               Baumann, Mr John D.   
15           16         1    Baxter, Mrs James (Helene DeLaudeniere Chaput)   
16           17         0                           Baxter, Mr Quigg Edmond   
17           18         0                               Beattie, Mr Thomson   
18           19         1                      Beckwith, Mr Richard Leonard   
19           20         1   Beckwith, Mrs Richard Leonard (Sallie Monypeny)   
20           21         1                              Behr, Mr Karl Howell   
21           22         0                                Birnbaum, Mr Jakob   
22           23         1                           Bishop, Mr Dickinson H.   
23           24         1           Bishop, Mrs Dickinson H. (Helen Walton)   
24           25         1           Bjornstrm-Steffansson, Mr Mauritz Hakan   
25           26         0                       Blackwell, Mr Stephen Weart   
26           27         1                                   Blank, Mr Henry   
27           28         1                            Bonnell, Miss Caroline   
28           29         1                           Bonnell, Miss Elizabeth   
29           30         0                           Borebank, Mr John James   
...         ...       ...                                               ...   
1283       1284         0               Vestrom, Miss Hulda Amanda Adolfina   
1284       1285         0                                    Vonk, Mr Jenko   
1285       1286         0                                Ware, Mr Frederick   
1286       1287         0                        Warren, Mr Charles William   
1287       1288         0                                  Wazli, Mr Yousif   
1288       1289         0                                  Webber, Mr James   
1289       1290         1                     Wennerstrom, Mr August Edvard   
1290       1291         0                                Wenzel, Mr Linhart   
1291       1292         0                        Widegren, Mr Charles Peter   
1292       1293         0                          Wiklund, Mr Jacob Alfred   
1293       1294         1                                 Wilkes, Mrs Ellen   
1294       1295         0                                  Willer, Mr Aaron   
1295       1296         0                                 Willey, Mr Edward   
1296       1297         0                          Williams, Mr Howard Hugh   
1297       1298         0                               Williams, Mr Leslie   
1298       1299         0                                Windelov, Mr Einar   
1299       1300         0                                   Wirz, Mr Albert   
1300       1301         0                             Wiseman, Mr Phillippe   
1301       1302         0                           Wittevrongel, Mr Camiel   
1302       1303         1                                 Yalsevac, Mr Ivan   
1303       1304         0                                Yasbeck, Mr Antoni   
1304       1305         1                               Yasbeck, Mrs Antoni   
1305       1306         0                                Youssef, Mr Gerios   
1306       1307         0                               Zabour, Miss Hileni   
1307       1308         0                               Zabour, Miss Tamini   
1308       1309         0                                Zakarian, Mr Artun   
1309       1310         0                            Zakarian, Mr Maprieder   
1310       1311         0                                   Zenn, Mr Philip   
1311       1312         0                                     Zievens, Rene   
1312       1313         0                                    Zimmerman, Leo   

            age  embarked  embarked=Cherbourg  embarked=Queenstown  \
0     29.000000         0                   0                    0   
1      2.000000         0                   0                    0   
2     30.000000         0                   0                    0   
3     25.000000         0                   0                    0   
4      0.916700         0                   0                    0   
5     47.000000         0                   0                    0   
6     63.000000         0                   0                    0   
7     39.000000         0                   0                    0   
8     58.000000         0                   0                    0   
9     71.000000         0                   1                    0   
10    47.000000         0                   1                    0   
11    19.000000         0                   1                    0   
12    31.194181         0                   1                    0   
13    31.194181         0                   0                    0   
14    31.194181         0                   0                    0   
15    50.000000         0                   1                    0   
16    24.000000         0                   1                    0   
17    36.000000         0                   1                    0   
18    37.000000         0                   0                    0   
19    47.000000         0                   0                    0   
20    26.000000         0                   1                    0   
21    25.000000         0                   1                    0   
22    25.000000         0                   1                    0   
23    19.000000         0                   1                    0   
24    28.000000         0                   0                    0   
25    45.000000         0                   0                    0   
26    39.000000         0                   1                    0   
27    30.000000         0                   0                    0   
28    58.000000         0                   0                    0   
29    31.194181         0                   0                    0   
...         ...       ...                 ...                  ...   
1283  31.194181         0                   0                    0   
1284  31.194181         0                   0                    0   
1285  31.194181         0                   0                    0   
1286  31.194181         0                   0                    0   
1287  31.194181         0                   0                    0   
1288  31.194181         0                   0                    0   
1289  31.194181         0                   0                    0   
1290  31.194181         0                   0                    0   
1291  31.194181         0                   0                    0   
1292  31.194181         0                   0                    0   
1293  31.194181         0                   0                    0   
1294  31.194181         0                   0                    0   
1295  31.194181         0                   0                    0   
1296  31.194181         0                   0                    0   
1297  31.194181         0                   0                    0   
1298  31.194181         0                   0                    0   
1299  31.194181         0                   0                    0   
1300  31.194181         0                   0                    0   
1301  31.194181         0                   0                    0   
1302  31.194181         0                   0                    0   
1303  31.194181         0                   0                    0   
1304  31.194181         0                   0                    0   
1305  31.194181         0                   0                    0   
1306  31.194181         0                   0                    0   
1307  31.194181         0                   0                    0   
1308  31.194181         0                   0                    0   
1309  31.194181         0                   0                    0   
1310  31.194181         0                   0                    0   
1311  31.194181         0                   0                    0   
1312  31.194181         0                   0                    0   

      embarked=Southampton  pclass=1st  pclass=2nd     ...      \
0                        1           1           0     ...       
1                        1           1           0     ...       
2                        1           1           0     ...       
3                        1           1           0     ...       
4                        1           1           0     ...       
5                        1           1           0     ...       
6                        1           1           0     ...       
7                        1           1           0     ...       
8                        1           1           0     ...       
9                        0           1           0     ...       
10                       0           1           0     ...       
11                       0           1           0     ...       
12                       0           1           0     ...       
13                       1           1           0     ...       
14                       1           1           0     ...       
15                       0           1           0     ...       
16                       0           1           0     ...       
17                       0           1           0     ...       
18                       1           1           0     ...       
19                       1           1           0     ...       
20                       0           1           0     ...       
21                       0           1           0     ...       
22                       0           1           0     ...       
23                       0           1           0     ...       
24                       1           1           0     ...       
25                       1           1           0     ...       
26                       0           1           0     ...       
27                       1           1           0     ...       
28                       1           1           0     ...       
29                       1           1           0     ...       
...                    ...         ...         ...     ...       
1283                     0           0           0     ...       
1284                     0           0           0     ...       
1285                     0           0           0     ...       
1286                     0           0           0     ...       
1287                     0           0           0     ...       
1288                     0           0           0     ...       
1289                     0           0           0     ...       
1290                     0           0           0     ...       
1291                     0           0           0     ...       
1292                     0           0           0     ...       
1293                     0           0           0     ...       
1294                     0           0           0     ...       
1295                     0           0           0     ...       
1296                     0           0           0     ...       
1297                     0           0           0     ...       
1298                     0           0           0     ...       
1299                     0           0           0     ...       
1300                     0           0           0     ...       
1301                     0           0           0     ...       
1302                     0           0           0     ...       
1303                     0           0           0     ...       
1304                     0           0           0     ...       
1305                     0           0           0     ...       
1306                     0           0           0     ...       
1307                     0           0           0     ...       
1308                     0           0           0     ...       
1309                     0           0           0     ...       
1310                     0           0           0     ...       
1311                     0           0           0     ...       
1312                     0           0           0     ...       

      ticket=248744 L13  ticket=248749 L13  ticket=250647  ticket=27849  \
0                     0                  0              0             0   
1                     0                  0              0             0   
2                     0                  0              0             0   
3                     0                  0              0             0   
4                     0                  0              0             0   
5                     0                  0              0             0   
6                     0                  0              0             0   
7                     0                  0              0             0   
8                     0                  0              0             0   
9                     0                  0              0             0   
10                    0                  0              0             0   
11                    0                  0              0             0   
12                    0                  0              0             0   
13                    0                  0              0             0   
14                    0                  0              0             0   
15                    0                  0              0             0   
16                    0                  0              0             0   
17                    0                  0              0             0   
18                    0                  0              0             0   
19                    0                  0              0             0   
20                    0                  0              0             0   
21                    0                  0              0             0   
22                    0                  0              0             0   
23                    0                  0              0             0   
24                    0                  0              0             0   
25                    0                  0              0             0   
26                    0                  0              0             0   
27                    0                  0              0             0   
28                    0                  0              0             0   
29                    0                  0              0             0   
...                 ...                ...            ...           ...   
1283                  0                  0              0             0   
1284                  0                  0              0             0   
1285                  0                  0              0             0   
1286                  0                  0              0             0   
1287                  0                  0              0             0   
1288                  0                  0              0             0   
1289                  0                  0              0             0   
1290                  0                  0              0             0   
1291                  0                  0              0             0   
1292                  0                  0              0             0   
1293                  0                  0              0             0   
1294                  0                  0              0             0   
1295                  0                  0              0             0   
1296                  0                  0              0             0   
1297                  0                  0              0             0   
1298                  0                  0              0             0   
1299                  0                  0              0             0   
1300                  0                  0              0             0   
1301                  0                  0              0             0   
1302                  0                  0              0             0   
1303                  0                  0              0             0   
1304                  0                  0              0             0   
1305                  0                  0              0             0   
1306                  0                  0              0             0   
1307                  0                  0              0             0   
1308                  0                  0              0             0   
1309                  0                  0              0             0   
1310                  0                  0              0             0   
1311                  0                  0              0             0   
1312                  0                  0              0             0   

      ticket=28220 L32 10s  ticket=34218 L10 10s  ticket=36973 L83 9s 6d  \
0                        0                     0                       0   
1                        0                     0                       0   
2                        0                     0                       0   
3                        0                     0                       0   
4                        0                     0                       0   
5                        0                     0                       0   
6                        0                     0                       0   
7                        0                     0                       0   
8                        0                     0                       0   
9                        0                     0                       0   
10                       0                     0                       0   
11                       0                     0                       0   
12                       0                     0                       0   
13                       0                     0                       0   
14                       0                     0                       0   
15                       0                     0                       0   
16                       0                     0                       0   
17                       0                     0                       0   
18                       0                     0                       0   
19                       0                     0                       0   
20                       0                     0                       0   
21                       0                     0                       0   
22                       0                     0                       0   
23                       0                     0                       0   
24                       0                     0                       0   
25                       0                     0                       0   
26                       0                     0                       0   
27                       0                     0                       0   
28                       0                     0                       0   
29                       0                     0                       0   
...                    ...                   ...                     ...   
1283                     0                     0                       0   
1284                     0                     0                       0   
1285                     0                     0                       0   
1286                     0                     0                       0   
1287                     0                     0                       0   
1288                     0                     0                       0   
1289                     0                     0                       0   
1290                     0                     0                       0   
1291                     0                     0                       0   
1292                     0                     0                       0   
1293                     0                     0                       0   
1294                     0                     0                       0   
1295                     0                     0                       0   
1296                     0                     0                       0   
1297                     0                     0                       0   
1298                     0                     0                       0   
1299                     0                     0                       0   
1300                     0                     0                       0   
1301                     0                     0                       0   
1302                     0                     0                       0   
1303                     0                     0                       0   
1304                     0                     0                       0   
1305                     0                     0                       0   
1306                     0                     0                       0   
1307                     0                     0                       0   
1308                     0                     0                       0   
1309                     0                     0                       0   
1310                     0                     0                       0   
1311                     0                     0                       0   
1312                     0                     0                       0   

      ticket=392091  ticket=7076  ticket=L15 1s  
0                 0            0              0  
1                 0            0              0  
2                 0            0              0  
3                 0            0              0  
4                 0            0              0  
5                 0            0              0  
6                 0            0              0  
7                 0            0              0  
8                 0            0              0  
9                 0            0              0  
10                0            0              0  
11                0            0              0  
12                0            0              0  
13                0            0              0  
14                0            0              0  
15                0            0              0  
16                0            0              0  
17                0            0              0  
18                0            0              0  
19                0            0              0  
20                0            0              0  
21                0            0              0  
22                0            0              0  
23                0            0              0  
24                0            0              0  
25                0            0              0  
26                0            0              0  
27                0            0              0  
28                0            0              0  
29                0            0              0  
...             ...          ...            ...  
1283              0            0              0  
1284              0            0              0  
1285              0            0              0  
1286              0            0              0  
1287              0            0              0  
1288              0            0              0  
1289              0            0              0  
1290              0            0              0  
1291              0            0              0  
1292              0            0              0  
1293              0            0              0  
1294              0            0              0  
1295              0            0              0  
1296              0            0              0  
1297              0            0              0  
1298              0            0              0  
1299              0            0              0  
1300              0            0              0  
1301              0            0              0  
1302              0            0              0  
1303              0            0              0  
1304              0            0              0  
1305              0            0              0  
1306              0            0              0  
1307              0            0              0  
1308              0            0              0  
1309              0            0              0  
1310              0            0              0  
1311              0            0              0  
1312              0            0              0  

[1313 rows x 581 columns]

Build the training and testing dataset


In [12]:
from sklearn.cross_validation import train_test_split
titanic_target = titanic['survived']
titanic_data = titanic.drop(['name', 'row.names', 'survived'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(titanic_data, titanic_target, test_size=0.25, random_state=33)

Let's see how a decision tree works with the current feature set.


In [13]:
from sklearn import tree
dt = tree.DecisionTreeClassifier(criterion='entropy')
dt = dt.fit(X_train, y_train)

In [14]:
import pydot, StringIO
dot_data = StringIO.StringIO()
tree.export_graphviz(dt, out_file=dot_data, feature_names=titanic_data.columns)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph.write_png('titanic.png')
from IPython.core.display import Image
Image(filename='titanic.png')


Out[14]:

In [15]:
from sklearn import metrics
def measure_performance(X, y, clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):
    y_pred = clf.predict(X)   
    if show_accuracy:
         print "Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n"
    if show_classification_report:
        print "Classification report"
        print metrics.classification_report(y, y_pred),"\n"
      
    if show_confussion_matrix:
        print "Confussion matrix"
        print metrics.confusion_matrix(y, y_pred),"\n"

In [16]:
from sklearn import metrics
measure_performance(X_test, y_test, dt, show_confussion_matrix=False, show_classification_report=False)


Accuracy:0.839 

Feature Selection

Working with a smaller feature set may lead to better results. So we want to find some way to algorithmically find the best features. This task is called feature selection and is a crucial step when we aim to get decent results with machine learning algorithms. If we have poor features, our algorithm will return poor results no matter how sophisticated our machine learning algorithm is. Select only the 20% most important features, using a chi2 test


In [17]:
from sklearn import feature_selection
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=20)
X_train_fs = fs.fit_transform(X_train, y_train)
print titanic_data.columns[fs.get_support()]
print fs.scores_[2]
print titanic_data.columns[2]


Index([u'age', u'embarked=Cherbourg', u'embarked=Southampton', u'pclass=1st', u'pclass=2nd', u'pclass=3rd', u'sex=female', u'sex=male', u'boat=1', u'boat=10', u'boat=11', u'boat=12', u'boat=13', u'boat=14', u'boat=14/12', u'boat=14/D', u'boat=15', u'boat=16', u'boat=2', u'boat=3', u'boat=4', u'boat=5', u'boat=5/7', u'boat=6', u'boat=7', u'boat=8', u'boat=9', u'boat=A', u'boat=B', u'boat=C', u'boat=D', u'home.dest=Aberdeen / Portland, OR', u'home.dest=Albany, NY', u'home.dest=Australia Fingal, ND', u'home.dest=Austria-Hungary / Germantown, Philadelphia, PA', u'home.dest=Ballydehob, Co Cork, Ireland New York, NY', u'home.dest=Bangkok, Thailand / Roseville, IL', u'home.dest=Barcelona, Spain / Havana, Cuba', u'home.dest=Bayside, Queens, NY', u'home.dest=Belgium  Montreal, PQ', u'home.dest=Belmont, MA', u'home.dest=Berne, Switzerland / Central City, IA', u'home.dest=Birkdale, England Cleveland, Ohio', u'home.dest=Bournmouth, England', u'home.dest=Bristol, Avon / Jacksonville, FL', u'home.dest=Brooklyn, NY', u'home.dest=Bryn Mawr, PA', u'home.dest=Calgary, AB', u'home.dest=Chelsea, London', u'home.dest=Chicago, IL', u'home.dest=Co Athlone, Ireland New York, NY', u'home.dest=Co Clare, Ireland Washington, DC', u'home.dest=Co Longford, Ireland New York, NY', u'home.dest=Cooperstown, NY', u'home.dest=Cornwall / Hancock, MI', u'home.dest=Deer Lodge, MT', u'home.dest=Denver, CO', u'home.dest=Detroit, MI', u'home.dest=Dowagiac, MI', u'home.dest=Duluth, MN', u'home.dest=England / Bennington, VT', u'home.dest=England Albion, NY', u'home.dest=England Brooklyn, NY', u'home.dest=England Oglesby, IL', u'home.dest=Finland / Minneapolis, MN', u'home.dest=Finland / Washington, DC', u'home.dest=Folkstone, Kent / New York, NY', u'home.dest=Green Bay, WI', u'home.dest=Greenwich, CT', u'home.dest=Guntur, India / Benton Harbour, MI', u'home.dest=Halifax, NS', u'home.dest=Harrisburg, PA', u'home.dest=Harrow, England', u'home.dest=Haverford, PA', u'home.dest=Haverford, PA / Cooperstown, NY', u'home.dest=Hessle, Yorks', u'home.dest=Hudson, NY', u'home.dest=India / Rapid City, SD', u'home.dest=Indianapolis, IN', u'home.dest=Italy Philadelphia, PA', u'home.dest=Lima, Peru', u'home.dest=Liverpool', u'home.dest=Liverpool, England Bedford, OH', u'home.dest=London  Vancouver, BC', u'home.dest=London /  East Orange, NJ', u'home.dest=London / Paris', u'home.dest=London, England Norfolk, VA', u'home.dest=Mt Airy, Philadelphia, PA', u'home.dest=New York, NY', u'home.dest=New York, NY / Ithaca, NY', u'home.dest=Norwich / New York, NY', u'home.dest=Paris / Haiti', u'home.dest=Paris, France', u'home.dest=Plymouth, Devon / Detroit, MI', u'home.dest=Rotherfield, Sussex, England Essex Co, MA', u'home.dest=Spain / Havana, Cuba', u'home.dest=St Louis, MO', u'home.dest=Sweden Winnipeg, MN', u'home.dest=Syria New York, NY', u'home.dest=Tuxedo Park, NY', ...], dtype='object')
41.2650346212
embarked=Cherbourg

Evaluate performance with the new feature set


In [18]:
dt.fit(X_train_fs, y_train)
X_test_fs = fs.transform(X_test)
measure_performance(X_test_fs, y_test, dt, show_confussion_matrix=False, show_classification_report=False)


Accuracy:0.848 

Find the best percentil using cross-validation on the training set


In [19]:
from sklearn import cross_validation

percentiles = range(1, 100, 5)
results = []
for i in range(1, 100, 5):
    fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=i)
    X_train_fs = fs.fit_transform(X_train, y_train)
    scores = cross_validation.cross_val_score(dt, X_train_fs, y_train, cv=5)
    #print i,scores.mean()
    results = np.append(results, scores.mean())

optimal_percentil = np.where(results == results.max())[0]
print "Optimal number of features:{0}".format(percentiles[optimal_percentil]), "\n"

# Plot number of features VS. cross-validation scores
import pylab as pl
pl.figure()
pl.xlabel("Number of features selected")
pl.ylabel("Cross validation accuracy)")
pl.plot(percentiles,results)
print "Mean scores:",results


Optimal number of features:6 

Mean scores: [ 0.83332303  0.87804576  0.87195424  0.86994434  0.87399505  0.86891363
  0.86992373  0.86991342  0.87195424  0.86991342  0.87194393  0.87398475
  0.86991342  0.87093383  0.86992373  0.86074005  0.86583179  0.86790353
  0.86891363  0.8648423 ]
-c:13: DeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future

Evaluate our best number of features on the test set


In [20]:
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=percentiles[optimal_percentil])
X_train_fs = fs.fit_transform(X_train, y_train)
dt.fit(X_train_fs, y_train)
X_test_fs = fs.transform(X_test)
measure_performance(X_test_fs, y_test, dt, show_confussion_matrix=False, show_classification_report=False)


Accuracy:0.860 

-c:1: DeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future

In [21]:
print dt.get_params()


{'splitter': 'best', 'min_density': None, 'compute_importances': None, 'max_leaf_nodes': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'random_state': None, 'criterion': 'entropy', 'max_features': None, 'max_depth': None}

Compute the best criterion, using the held out set (see next notebook on Model Selection)


In [22]:
dt = tree.DecisionTreeClassifier(criterion='entropy')
scores = cross_validation.cross_val_score(dt, X_train_fs, y_train, cv=5)
print "Entropy criterion accuracy on cv: {0:.3f}".format(scores.mean())
dt = tree.DecisionTreeClassifier(criterion='gini')
scores = cross_validation.cross_val_score(dt, X_train_fs, y_train, cv=5)
print "Gini criterion accuracy on cv: {0:.3f}".format(scores.mean())


Entropy criterion accuracy on cv: 0.879
Gini criterion accuracy on cv: 0.880

In [23]:
dt.fit(X_train_fs, y_train)
X_test_fs = fs.transform(X_test)
measure_performance(X_test_fs, y_test, dt, show_confussion_matrix=False, show_classification_report=False)


Accuracy:0.863